The tl;dr version: in a given box of
Lucky Charms, for each bowl of cereal eaten we estimate a decrease of
approximately 2.7 total charms per bowl on average. The weight of cereal
also appears to play a role when suitably included and for each increase
of 1g of cereal we estimate approximately 0.5 more charms on
average—with bowl held constant. This corresponds to a reduction of over
50% in charms from the 1st bowl to the last. The interaction between
bowl and weight is not statistically significant.
See this GitHub repository for all of the data, code, photos, etc.
In the early 2010’s there were a series of articles that circulated on the Internet about somebody investigating whether or not “Double Stuf” Oreos were truly double-stuffed. (They’re not.) It was a neat idea, and a substantial amount of material has been written about it in the years since, see here to get started. It evidently caused enough splash that some teachers were said to be using the experiment as an activity in their classrooms, and students locally have reported performing the experiment at their own school.
Against this backdrop, in the summer of 2023 I was eating a bowl of Lucky Charms for breakfast one morning. The box was nearly empty, and I thought to myself, “I’d almost rather throw the rest of this box away and open up a brand new box of Lucky Charms!” Now, if you’re anything like me, or millions of other people, then you love Lucky Charms, and you’ve loved them for as long as you can remember. They truly are Magically Delicious! But sitting there that summer morning with a spoon in my hand it occurred to me that the ending bowls of the box just didn’t quite seem as magical as the first bowls had been. They were missing something. (The charms, of course.) But was it my imagination? Could this effect be real? And if so, could it be measured?
I happened to be teaching an upper-division undergraduate Probability & Statistics class at the time and the four students and I were determined to find out.
After some discussion, the team decided on the following materials and methods.
The Lucky Charms were purchased from our local retailer—Wal-Mart. There was nothing special about \(n = 6\) boxes, it was simply the number of boxes a person could carry with two hands to the 6th floor of Cafaro Hall in a single trip. The kitchen scale was for measuring the weight of cereal, which the team thought might be important, and the scale could also help with data collection because we didn’t want to be overly preoccupied with sampling the exact same amount of cereal every time.
For the purposes of this experiment, a “bowl” was taken to be approximately 1 serving of cereal as recommended on the box (1 cup or 36g), even though it is ridiculous for anybody but a tiny magic leprechaun to get by on a measly 36g of Lucky Charms for breakfast. The team was not especially careful with maintaining bowl size consistency, and anything close to 1 cup was considered good enough. We were accounting for mass of cereal with the kitchen scale anyway and were shooting for a healthy range of observed weights.
Each bowl was poured, delicately, from the box into the plastic container, weighed, and then emptied onto a table surface for counting. The toasted oats were separated from the marshmallows and discarded. Next the following eight (8) charm types were recognized and their number recorded: Pink Hearts, Rainbows, Purple Horseshoes, Blue Moons, Green Clovers, Unicorns, Tasty Red Balloons, and Orange Stars.
Most of the time there were little bits and pieces of charms mixed about in the bowl; not every charm was 100% whole. To deal with this, the team attempted to classify the bit as to the type of charm (Green Clover, Blue Moon, etc.), and if the type could be determined, then that bit was counted as 1 in the respective category. If the bit was too small or nondescript for type identification then it was discarded.
Data were collected in class meetings on two separate occasions—we had other statistical ground to cover in the course after all. The students worked in pairs to pour and count the charms. I helped with the scale and recorded a hard copy of weight values as they were called out and entered into the computer. The team got into a data collection groove and by the end of the experiment all 4 students were pouring and counting independently.
The plastic container + cereal were weighed together each round, and then the weight of the container (measured at the start of the experiment) was subtracted from the observed total weight. The charms were entered into their respective columns and totaled.
Box: the box number (1 through 6)Bowl: the sequential bowl for each box (ranges from 1
to 13)Observation: the observed order of bowls across boxes
(1 to 69)Totweight: weight of each plastic container + cereal,
in gramsWeight: of cereal in grams, after subtracting the
weight of the containerHearts, Stars, etc: how many of that charm
in that bowlTotcharms: sum total of the assorted charmsHere is a look at the top of the data set (first 6 rows):
library(readxl)
Lucky <- read_excel("Lucky.xlsx")
Lucky$Box <- as.factor(Lucky$Box)
knitr::kable(head(Lucky))
| Box | Bowl | Observation | Totweight | Weight | Hearts | Stars | Horseshoes | Clovers | Moons | Unicorns | Balloons | Rainbows | Totcharms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 60.52 | 32.145 | 5 | 5 | 14 | 3 | 4 | 2 | 1 | 4 | 38 |
| 1 | 2 | 2 | 69.92 | 41.545 | 7 | 1 | 7 | 8 | 6 | 1 | 5 | 4 | 39 |
| 1 | 3 | 3 | 66.18 | 37.805 | 5 | 4 | 4 | 4 | 6 | 6 | 4 | 6 | 39 |
| 1 | 4 | 4 | 63.82 | 35.445 | 5 | 6 | 8 | 1 | 4 | 7 | 4 | 2 | 37 |
| 1 | 5 | 5 | 75.54 | 47.165 | 4 | 1 | 5 | 4 | 7 | 2 | 2 | 6 | 31 |
| 1 | 6 | 6 | 77.28 | 48.905 | 6 | 4 | 4 | 4 | 8 | 3 | 1 | 6 | 36 |
Equipped with these data we can report things like the mean observed
Weight was approximately 46.3g, the maximum number of a
particular charm in any one bowl was 15 (Pink Hearts tied with Purple
Horseshoes), and so forth. We could spend all day computing statistics
on this dataset to our Pink Heart’s content, but at the moment we are
primarily concerned with Totcharms and how it relates to
Bowl and maybe Weight to a lesser extent.
Here is a graph of Totcharms by Bowl,
colored by Box:
Lucky |> ggplot(aes(x = Bowl, y = Totcharms, color = Box)) +
geom_point(size = 3) +
labs(y = '# Charms') -> p1
p1
Here we see a clear decreasing trend in Totcharms as
Bowl increases, and the pattern is surprisingly linear.
There may be a slight curvature. The colors are difficult to pick out,
so let’s make a line plot and highlight a couple of series:
sizes <- c(2, 1, 2, 1, 1, 1)
alphas <- c(1, 0.2, 1, 0.2, 0.2, 0.2)
Lucky |> ggplot(aes(x = Bowl, y = Totcharms)) +
geom_line(aes(colour = Box, linewidth = Box, alpha = Box)) +
scale_discrete_manual("linewidth", values = sizes) +
scale_alpha_manual(values = alphas, guide = "none")
We can see a general trend downward for each series, but the path to get there varies for individual boxes. Notice how Box 3 starts high and stays high for a few bowls before dropping off smoothly until Bowl 10 after which it crashes. Look at how Box 1 starts relatively low, then begins increasing after Bowl 5, peaks at Bowl 8, then nosedives down to Bowl 12. It’s as if the charms were more concentrated near the top of Box 3, but were more clustered in a pocket near the middle of Box 1. We observe some boxes to bounce around, while other boxes drift in more of a straight line downward. Put it all together, though, and the overall trend is decreasing and linear. Note that all boxes lasted to Bowl 11, but only 2 boxes had 12 bowls, and a single box (Box 4) made it to Bowl 13.
Now let’s take a look at Totcharms versus
Weight.
Lucky |> ggplot(aes(x = Weight, y = Totcharms, color = Box)) +
geom_point(size = 3) +
labs(x = 'Weight (g)', y = '# Charms') -> p2
p2
This is a much noisier plot—as we might have guessed. We have a nice
range of weights, from a minimum under 30g up to a maximum near 70g.
Note that the lowest weights are usually associated with the last bowl
of the box: there isn’t enough cereal remaining to make a complete 36g
bowl. And notice there is one bowl that was extraordinarily heavy. The
explanation for where that could have come from isn’t obvious, but if we
dig a little deeper and plot Weight versus
Bowl we may have a hint:
Lucky |> ggplot(aes(x = Bowl, y = Weight, color = Box)) +
geom_point(size = 3) + ylim(5, 75) +
labs(y = 'Weight (g)') -> p3
p3
Now we see that the extra-heavy bowl was the last
Bowl = 12 of Box = 1. The reason behind that
particular data point has unfortunately been lost to the sands of time,
but bearing in mind that it was the first box the team had ever
finished, near the end it may have been difficult to judge how much
cereal was left, and perhaps the rest was poured into that final twelfth
bowl—I personally do the same thing all the time at breakfast time as I
near the end of a box of cereal. Looking at the points on the graph, if
that 12th bowl of almost 70g had been split into (say) two bowls of 40g
and 30g, respectively, then there would have been two boxes that made it
to 13 bowls instead of one box, and maybe the models below would have
fit the observed data slightly better. Alas! We will never know. Such is
the scientific enterprise.
While the relationship between Totcharms and
Weight is poorly linear, there is a hidden relationship
between Totcharms, Bowl, and
Weight which can best be displayed with a 3D
visualization.
library(plotly)
fig <- plot_ly(Lucky, x = ~Bowl, y = ~Weight, z = ~Totcharms, color = ~Box) |>
add_markers() |>
layout(scene = list(xaxis = list(title = 'Bowl'),
yaxis = list(title = 'Weight (g)'),
zaxis = list(title = '# Charms')),
legend=list(title=list(text='Box')))
fig
Note that the 3D plot is interactive. Go ahead, spin it around, zoom, pan—check it out. If you spin it around just right you will see that the dots lie more-or-less on a flat plane in 3D-space. This is exactly the kind of relationship we are looking for in a multiple linear regression model (we’ll get to that in a minute).
Now let’s try to quantify the linear relationship between these
variables. We will start with a simple linear regression model relating
Totcharms to Bowl.
Here is the model:
mod1 <- lm(Totcharms ~ Bowl, data = Lucky)
summary(mod1)
##
## Call:
## lm(formula = Totcharms ~ Bowl, data = Lucky)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.7629 -5.7629 -0.4327 6.2277 22.2277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.1309 2.1237 25.960 < 2e-16 ***
## Bowl -2.6698 0.2985 -8.945 4.81e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.313 on 67 degrees of freedom
## Multiple R-squared: 0.5442, Adjusted R-squared: 0.5374
## F-statistic: 80.01 on 1 and 67 DF, p-value: 4.807e-13
We see that Bowl is strongly linearly associated with
Totcharms. The slope on Bowl is approximately
\(-2.7\), in other words, for each
additional bowl of Lucky Charms eaten we estimate the average
Totcharms to decrease by 2.7 charms. Our coefficient of
determination is \(R^2 = 0.5442\), that
is, approximately 54% of the variance in Totcharms is
explained by the regression model with Bowl as a predictor.
There should be a whole discussion included here about residual analysis
which we are going to skip, but suffice it to say that the residual
plots are relatively well-behaved. Let’s check out a fitted line plot
with confidence bands for the regression line (the default):
p1 + geom_smooth(method = "lm", aes(group=1), colour="black")
That’s a nice relationship with a clear decreasing trend.
We will do the same thing for Weight, ignoring
Bowl for the time being. Here we go:
mod2 <- lm(Totcharms ~ Weight, data = Lucky)
summary(mod2)
##
## Call:
## lm(formula = Totcharms ~ Weight, data = Lucky)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.0151 -8.7745 0.6901 7.8328 24.4701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.1370 10.5650 2.095 0.0399 *
## Weight 0.3502 0.2256 1.552 0.1254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.1 on 67 degrees of freedom
## Multiple R-squared: 0.0347, Adjusted R-squared: 0.02029
## F-statistic: 2.409 on 1 and 67 DF, p-value: 0.1254
We do not find Weight to be a very useful predictor of
Totcharms on its own, which is consistent with the
scatterplot we have already seen. We note for reference that the slope
on Weight is estimated at \(0.3502\), that is, each additional 1g of
Lucky Charms corresponds to an average Totcharms increase
of 0.35 charms. This sounds reasonable: more cereal, more charms. The
coefficient of determination is pretty bad: \(R^2 = 0.0347\), in other words,
approximately NONE% of the variance in Totcharms is
explained by the regression model with Weight as a
predictor. That’s okay; Weight was more of a supplementary
measure to help control for variability in the poured cereal amounts.
The residual analysis turns out to be not as bad as it could have been,
which is nice, and we expected at least a few problems anyway given the
extreme observations on the high and low ends of the weight range. For
the sake of completeness we will include another fitted line plot:
p2 + geom_smooth(method = "lm", aes(group=1), colour="black")
I originally planned to use the ggpubr package to put
these fitted-line plots together and try to save some space in the
discussion, but the plots were rather cramped and not very informative.
Anyway, this is what I would have done:
library(ggpubr)
ggarrange(p1 + geom_smooth(method = "lm", aes(group=1), colour="black"),
p2 + geom_smooth(method = "lm", aes(group=1), colour="black"),
align = 'h', labels=c('A', 'B'), legend = "right",
common.legend = TRUE)
Now we get to the fun part: we’ve explored the relationship between
Totcharms and Bowl and
Totcharms ~ Weight individually, but what happens if we put
them together? Let’s find out:
mod3 <- lm(Totcharms ~ Bowl + Weight, data = Lucky)
summary(mod3)
##
## Call:
## lm(formula = Totcharms ~ Bowl + Weight, data = Lucky)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8825 -5.4425 -0.9975 5.2475 26.5304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.3168 6.8655 4.853 7.78e-06 ***
## Bowl -2.7552 0.2796 -9.855 1.35e-14 ***
## Weight 0.4819 0.1452 3.318 0.00148 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.754 on 66 degrees of freedom
## Multiple R-squared: 0.6094, Adjusted R-squared: 0.5976
## F-statistic: 51.49 on 2 and 66 DF, p-value: 3.363e-14
Check it out! Now Bowl and Weight are
both strongly linearly associated with
Totcharms. The slope on Bowl is nearly
identical to what it was previously, \(-2.7\), but the estimated slope on
Weight has now increased to nearly 0.5 charms for each
additional 1g of cereal. Our (adjusted) Multiple \(R^2\) has jumped to almost 60%—this is
remarkable considering the relatively small sample size, the overall
noise level in the dataset, and perhaps some questionable design choices
(every little marshmallow bit counts as 1, etc.). In retrospect, it is
kind of amazing that the data didn’t turn out to be worse. It is not
often that real data collected by hand in the wild are so
good-natured.
We’ve got to do this one. The code is slightly more involved than the previous examples and has been omitted for brevity, but you can check it all out in this GitHub Gist. Let’s get on with the plot: